What is modeling?
Can mean 3 things:
Error & cost:
Features:
Tables
Structured Data
Unstructured Data
Common types of structured data
Applied to credit loan problem:
m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
x_i1 = credit score of person 1
x_i2 = income of person i
y_i = response for data point i
y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0
Notes:
Scaling linearly:
Scaling to a normal distribution:
Which method to use? Depends:
Idea:
Things to keep in mind:
Data has 2 types of patterns:
"Fitting" matches both:
How to measure a model's performance:
What if we want to compare 5 runs of SVM and KNN?
Problems:
Solution:
How much data to use?
How to split data?
Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.
k-1 times for training, and 1 time for validationk is 10.k splits. Train the model again using all the data.Definition: Grouping data points
Use:
n dimensionsp is set to infinity.Why use infinity norms?
How it works:
k cluster centers within range of dataCharacteristics:
heuristic: Fast, good but not guaranteed to find the absolute best solution.k as well.k for optimization.k == # of data points may be the most theoretically optimal, but does that actually make sense for the task? k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.
Another classification task: What range of possible data points would we assign to each cluster?
Image of cluster region, aka "Voronoi diagram"
Data Preparation: Quantitative Examples
Things to watch out for before building models:
Definition: Data point that's very different from the rest.
Depends on the data:
Actions:
Definition: Determining whether something has changed.
Why:
Definition: Detect increase/decrease or both by the cumulative sum.
How:
Terms:
Formula to detect increase (True if $S_t \ge T$)
$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.
$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.
Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.
Example:
Time series complexities
Trends
Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$
Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$
Cyclic
Two ways to calculate:
Baseline formula (including trend + seasonality)
$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$Update the seasonal (cyclic) factor in a similar way as trends:
$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$Example: Sales trends
Starting Conditions
For trend:
For multiplicative seasonality
"Smoothing"
Graph of what it looks like:
"Exponential"
Each time period estimate can be plugged in like this:
Given basic exponential smoothing equation
$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.
Using $S_t$, the forecast for time period $t+1$ is
$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$hence, our estimate is the same as our latest baseline estimate
$$F_{t+1} = S_t$$Factoring in trend/cycle
Above equation can beextrapolated to trend/cycle calculations.
Best estimate of trend is the most current trend estimate:
$$F_{t+1} = S_t + T_t$$Same for cycle (multiplicative seasonality)
$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...
3 key parts
1. Differences
For example:
$$D_{(1)t} = (X_t - X_{t-1})$$
$$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$
$$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$
2. Autoregression
Definition: Predicting the current value based on previous time periods' values.
Augoregression's exponential smoothing:
Order-p autoregressive model:
"ARIMA" combines autoregression and differencing
3. Moving Average
ARIMA (p,d,q) model
$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$Choose:
Other flavors of ARIMA
Definition: Estimate or forecast the variance of something, given a time-series data.
Motivation:
Mathematical Model:
$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$What it explains:
Definition: Linear regression with one predictor
Example:
Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$
Best-fit regression line
How it works:
AIC applied to regression
Equation:
Example:
Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$
Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$
Result:
Characteristics:
BIC Metrics - Rule of thumb